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SUMMARY 

Text searching programs such as the UNIX system tools grep and egrep require more than just 
good algorithms; they need to make efficient use of system resources such as I/O. I <?«cnbe 
improving the I/O management in grep and egrep by using a new fast !/° library fw to replace the 
normal I/O library stdio. 1 also describe incorporating the Boyer-Moore algorithm mto egrep, egrep 
is now typically 8—10 (for some common patterns 30—40) times faster than grep. 
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INTRODUCTION 

In the beginning. Ken Thompson wrote the searching tool grep It took a regular 
expression and file names as arguments and printed from those files lines ( that 
matched the regular expression. In 1975, just after the release of the Sixth Edition 
of the UNIX system, A1 Aho decided to put theory about finite state automata into 
practice and wrote egrep over a weekend. Egrep supported full regular expressions 
(including alternation and grouping, which were missing from grep) and used a 
deterministic finite automaton rather than grep' s nondetermimstic finite automaton. 
Egrep was about twice as fast as grep for simple character searches but was slower 
for complex search patterns due to the high cost of building the state machine that 
recognized the patterns. Fgrep, specialized for the case of many alternate literal 
strings, was written in the same weekend. 

Ever since, each of the tools has sporadically improved its performance, mostly as 
a friendly rivalry between the owners of grep (Thompson, and later on , . Lee 
McMahon) and egrep (Aho). I inadvertently joinpd this rivalry in mid 19861 by 
improving grep' s I/O management, thus enabling grep to leapfrog in front of egrep 
Ke colon case of simple patterns. During August 1987 I -proved the per- 
formance of grep by another factor of 2 and that of egrep by a factor of 3. By imple- 
menting the Boyer-Moore algorithm for literal strings within egrep I improved 
egrep' s performance (for patterns containing literal strings) by another factor of 8 

The next two sections describe the specifics bf tuning grep and ^. Readers 
interested solely in the bottom line can skip straight to the final section, which com- 
pares the current state of the various greps and offers recommendations as to their 

usage. 
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In statements of the form ... .v is n times as fast as y’, n is time(y)/time(x) 
Unless otherwise qualified, time refers to user CPU time. All timing tests are sum- 
marized in an appendix. 


TUNING GREP 

1 he primary motivation lor tuning grep was its slow performance on large (> 
10MB) files. (In retrospect, the improvements described below are apparent even 
for smaller tiles of 100KB or more.) The limited time available for tuning pre- 
cluded work on the pattern matching code so I concentrated on improving the I/O 
management. In particular, I wanted to sec the speed increase due to replacing 
stilw routines by fw routines. The [to library! 1] is a simple buffered I/O library 
whose main goals are speed and portability. It is meant to supplant the venerable 
standard I/O library steJio. 

The timing examples search for patterns in a large file. The file was a list of 
13,931 file names (one per line) totaling 512,000 bytes. Because of my focus on I/O 
performance, I chose a pattern (beginning of line) that was cheap to match. Thus 
the execution times and run profiles reflect as little of the matching code as possi- 
ble. The timing examples differ only in the amount of output generated. In each 
example, every line in the input is matched. The example generating a small 
amount of output (denoted ‘small’ hereafter) is 

grep -c ,A ' datafile 

The -c option means print a count of the matching lines. The example generating 
a large amount of output (denoted ‘large’ hereafter) is 

grep ' A ' datafile 

which copies every input line to the output. 

The structure of grep is fairly simple. The main program loops over its file argu- 
ments processing each one in turn. The processing of each file is also fairly simple; 
the file is opened, each line is read, the pattern is matched against the line and the 
appropriate action taken on that line. Most often, the action is to print the line if 
the pattern matched the line. Thus the processing can be represented as 

while ( gets ( buf ) != NULL) 

if ( match ( pattern , buf ) ) 

printf ( "%s\n" , buf); 

The execution profile for the original grep (denoted here as grep. 1) is 




grep! 




small (12.5s) 

% time time routine 

large (25.5s) 

% time time routine 

79.1 

9.9s 

gets 

43.7 

11.2s 

printf 


14.8 

1.8s 

match 

38.9 

9.9s 

gets 


5.7 

0.7s 

read 

8.7 

2.2s 

match 





5.7 

1.5s 

write 

- •-tfgg 




2.7 

0.7s 

read 



The numbers following ‘small’ and ‘large’ are the user execution times (on a VAX 
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11/750) in seconds. The routines read and write are system calls. The routine 
match does the pattern matching. The other routines are described below. 

The first candidate for tuning is gets, which reads one line into the buffer buf. 
(We always have to read all the input, regardless of how much output is generated.) 
The corresponding fto routine is Frdline which returns a pointer to the line rather 
than copying the line into a buffer. After changing the appropriate file open and 
file close calls, and changing buf from a character array to a character pointer, we 
obtain grep.2 which has this skeleton: 

while(buf = Frdline(fd)) 

if (match (pattern, buf)) 

printf ( "%s\n" , buf); 

The new execution profile is 


?rep.2_ 



small (4.2s) 

time roucine 

% time 

large (16.5s) 

time routine 

41.0 

1.7s 

match 

63.0 

10.4s 

printf 

40.9 

1.7s 

Frdline 

12.8 

2.1s 

match 

17.1 

0.7s 

read 

11.0 

1.8s 

Frdline 




7.9 

1.3s 

write 




4.9 

0.8s 

read 


Most of the input time is now spent reading and breaking the input into lines. 
Frdline has been carefully optimized for this; it is hard to do better without a great 
deal of work. The next candidate for tuning is printf, which consumes most of the 
run-time. Now, all that this call of printf does is copy the line to standard output. 
There are two other ways to do this within stdio: puts, which prints a null- 
terminated line, or fwrite, which outputs a string of known length (the Frdline rou- 
tine stores the line length). In most versions of the UNIX system, fwrite is poorly 
implemented, so it is not surprising that grep.3, which uses fwrite, is about the same 
speed as grep.2, which uses printf. 

Replacing the printf call by puts gives grep.4 , which is 1.34 times as fast as grep.2. 
The execution profile (only the large output case is relevant) now looks like this: 


tr.'i .L 


large (12.3s) 

% time time routine ... 

54.6 

6.7s 

puts 

15.3 

1.9s 

match 

13.8 

1.7s 

Frdline 

10.0 

1.2s 

write / 

5.9 

0.7s 

read 


Eliminating stdio completely by replacing the puts by Fwrite changes the main 
loop to 

whiletbuf - Frdline(fd)) 

if (match (pattern, buf)){ 

Fwrite ( 1 , buf , FIOLINELEN ( f d ) ) ; 

FputcCI, '\n'); 

} 
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The new program, grep.5, is 1.71 times as fast as grepA. Another benefit of using 
Fwrite (rather than puts) is that input lines containing NULLs arc correctly handled 
instead of being truncated by puts. The high overhead of procedure calls on the 
VAX (where these experiments are being run) suggests eliminating the call to 

Fputc: 

while (buf = Frdline (fd)) 

if (match! pattern , buf ) ) { 

register x = FIOLINELEN ( f d ) ; 

buf[x] = '\n'; /* restore the \n Frdline took */ 

Fwrite(1, buf, x+1); 

} 

This program, grep.6 , runs 1.08 times as fast as grep.5. The execution profile is 

grep.6 

large (6.7s) 


% time 

time 

routine 

30.2 

2.0s 

match 

22.6 

1.5s 

Frdline 

20.8 

1.4s 

F write 

16.0 

1. Is 

write 

9.9 

0.7s 

read 


For interest’s sake, I timed a version of grep.6 that uses unbuffered I/O, that is, 
using the system call write rather than F write. The user time dropped 5 per cent to 
A. 3s but the time spent inside the operating system increased 11.64 times from 
2.92s to 33.99s! Buffering, at least in our version of UNIX, is a necessary evil. 

The payoff for using fw is high; grep runs 3. 1-3.8 times as fast as than the origi- 
nal version. There is no way to continue tuning without attacking the pattern 
matching code. (F write has no fat in it!) That effort is reserved for egret). 
described below. 



TUNING EGREP 

The original implementation of egrep was a deterministic state machine. Perfor- 
mance was often poor because of the potentially exponential time needed to con- 
struct the state machine. In 1983, Aho introduced cached lazy evaluation of the 
state transition tables by techniques described in Reference 2. In practice, as few 
state transitions are needed, the lazy algorithm runs nearly as fast as the original 
algorithm but with zero initialization time and just one additional test inside the 
inner loop. 

Egrep was profiled on a per-instruction basis with /comp 3 because the critical code 
is written as a loop with only a few rarely used function calls and with hand-coded 
input routines. The resultant profiles are not shown because of their bulk; the 
appendix lists the execution times. As I tuned the pattern matching code for egrep 
as well as the I/O code, there is also a timing example denoted ‘matching’ that exe- 
cutes the pattern matching code for every byte of the input: egrep abed input 
(note that no input line contains abed). 
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Given my experience with grep, I first examined I/O Egrep.2 ^p'^es the W/o 
output routines in egrep.l by the appropriate routines from fio . T s P rodu “ 
similar result as for grep: no improvement for the small output case and 1.94 urn 

faster for the large output case. . , , , . „ 

Next, I changed the input code to use fio routines. The input code had used a 
peculiar buffering scheme that complicated the state machine code considerab y. 
The conversion to fio was intended mainly to simplify the code and tome^ure how 
much difference there was between the fio library and hand-coded I/O. The input 
code used to be 

if( — ccount <= 0)1 

if(p <= 5,buf [BUFSIZ] ) { 

if ((ccount = read(f, P, BUFSIZ)) <= 0) goto done; 

} else if (p =ss &buf [2*BUFSIZ] ) { 
p = buf ; 

if ((ccount = readtf, p, BUFSIZ)) <= 0) goto done; 

1 elS if< (ccount = read(f, p, ibuf [ 2*BUFSIZ] -p ) ) <= 0) 
goto done ; 

} 


The fio version is 

if ( *p++ s - / \n / ){ 

if((p = Frdline ( f ) ) == 0) goto done; 
len = FIOLINELEN ( f ) ; 

p[len++] = '\n' ; /* restore the \n Frdline took */ 

} 

The fio routines run significantly faster because the old code does many small reads 
as gets near the end of the buffer. The resulting program (egrep.3) ms 1.57 and 
1 34 times as fast as egrep.2 for the small and large cases respectively. Surprisingly, 
egL K shghdy slower for the matching case. The reason in retrospect is dear: 
the ^pattern matching code has to examine each byte and finds the (e " d of '‘ ne) 
anyway. Thus, the work that Frdline does finding the \n is superfluous. The next 
set of changes concentrated on the state machine code: 

register char *p; ) 

register t, cstat; 

if ( (t = gotofnlcstat] [c = *p&0377]) == 0) 
cstat = nxtst(cstat» c); 

else 

cstat = t; 

if (out [cstat] ){ /* match «/ 1 
i£( , p++ == '\n'){ /* get new line */ > 

The matrix gotofn encapsulates the state transition cache. The indexing of gptofn 


1068 


ANDREW HUME 


seems clumsy and before checking the generated r a 

making p unsigned and doubling the array size to 256 This t “ ,mp 1 rove u b V 

tive; some implementations of char are signed. The result Is ^ : #P ^ P ° Si ' 


register unsigned char *p; 
register t, cstat; 

for ( ; ; ) { 

if((t = gotofnf cstat H *p] ) == q) 
cstat = nxtstfcstat, *p); 

else 

cstat = t; 

if ( out [ cstat ]) { /# match */ ) 
if(*p+t = = /. get new line */ } 




LI : 


L2 : 


L3 : 


ashl 

$8 , r 1 0 , rO 



addl2 

$_gotofn,rO 

* 

rO = 

cvtbl 

tr 1 1 ) ,r 1 

# 

r 1 = 

addl2 

r 1 ,r0 



movb 

( rO ) , r9 

# 

t = g, 

jneq 

L2 




# call nxtst 



movzbl 

r9,r10 

# 

cstat 

tstb 

_out ( r 10 ) 



jeql 

L3 




* match 



empb 

( r 1 1 ) + , $ 1 o 

# 

*P++ : 

jneq 

LI 




* get new line 


jbr LI 





Instr 


Meaning 


Instr 


add 12 add opl to op2 

aahl left shift op2 by opl into op] 

empb compare op I to op2 as bytes 


Meaning 


cvtbl convert (sign-extended) byte to long 
movttbl convert Izcro-cxtendcd) byte to long 
tntb compare opl to zero os bytes 


s„*Tb c ,” d ™' m-"-* ~ 

Five instrucrions are spent calculating^ nit state obvTou W lr' * e . m “ n . loo P- 
expensive. As we have to index by the input character the he r ' ndcx >ng is 

a vector instead of a matrix. This mean-; ’ C beSt W< ^ Can do ,s t0 use 

vector of transitions. The code for egrep.5 i s: P " S StateS aS data with a 
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typedef struct State { 

struct State *gotofn[NCHARS ] ; 
int out; 

} State; 

register unsigned char *p; 
register State *cstat, »t; 


for ( ; ; ) { 

if ( ( t = cstat->gotofn[*p] ) == 0) 
cstat = nxtst( cstat, *p) ; 

else 

cstat = t; 

if (cstat->out) { /* match •/ } 
if(*p++ == ' \n ' ) { /* get new line */ } 

} 


This added four lines of code to eg rep (the declaration lines). Yet the inner loop is 
now only five instructions plus three branches and egrep.S is 1.55 times as fast as 
egret) 4 (this and following times refer to an example where pattern matching 
predominates). Moreover, it is relatively easy for a compiler to generate good code 
for this source. The corresponding assembler code is 


LI: 


L2: 


L3 : 


cvtbl 

(r 1 1 ) ,r0 * 

rO * »P 

movl 

(r 10 ) [rO ] ,r9 * 

t = cstat->gotol 

jneq 

L2 



# call nxtst 


movl 

r9 , r 10 * 

cstat = t 

tstl 

1024 (r 10 ) * 

cstat->out == 0 

jeql 

L3 



# match 


cmpb 

(rl 1 ) + , $10 * 

* p++ == '\n' 

jneq 

LI 



# get new line 


jbr LI 




An instruction count execution profile reveals about 80 per cent of the time is spent 
executing these five instructions. Another 15 per cent is spent doing I/O. 

In the general case, each input character may complete the match of the regular 
expression. Thus, as every character has to be examined, theresecmtobeoy 
two performance improvements possible. Firstly, decreeing the >ength of the 
inner loop from five instructions to four. This is not possible on the VAX because 
it does not support indexing by a byte operand (*p); other hardware may be able 
do the indexing in one instruction. Secondly, it is possible to P^t the test for 
inside the match and thus move it out of the inner loop. Unfortunately, this dou 

bles the number of states and complicates new state generation. 

Can we avoid looking at every character? The answer is yes — ‘^ are looking 
for literal strings. A literal string is a regular expression with no meta-character, 
that is every character stands for itself. The Boyer-Moore algorithm performs 
very fast searches for a literal string. It tries to match the pattern against the input 


S' 
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from right-to-left. Failures can often advance the input pointer by the length of the 
search pattern, for example, if the search string is 10 characters long and does not 
contain a y (say), and the current input character is a y; then we can advance over 
10 input characters as the y cannot match any part of the pattern. In general, we 
can advance the input pointer by the distance of the character from the end of the 
pattern. If the input character matches, we then do a character-by-character check. 

Egrep.6 detects patterns that are literal strings and executes special purpose code 
(about 164 lines ol C, a -7 per cent increase of lines of code) achieving spectacular 
results, 8.12 times as fast as eg rep.5 . The code implements the simple Boyer-Moorc 
search just using the deltaO table as recommended in Reference 5. This is not 
guaranteed to run in linear time but does have a 4— instruction inner loop on the 
VAX. (It is puzzling how long this took to come about; although the Bover-Moore 
algorithm was published in 1977, the first workable implementation of Bover-Moore 
within egrep was done by James A. Woods at NASA Ames Research Center in 1986.) 
An unexpected consequence of the efficiency of Boyer-Moore is that asking egrep to 
give the line number lor lines that match slows egrep down by a factor of 2 because 
it now has to look at every input byte to count the newlines. 

What about regular expressions (rather than literal strings)? Boyer-Moore cannot 
be directly applied to the regular expression matcher. However, it can filter out 
lines that can’t match by extracting the longest (in practice, the most effective for 
Boyer-Moore) literal string in the regular expression and running the regular 
expression matcher only on the lines that match. The literal string extraction runs 
in time linear in the length of the pattern but docs not handle common substrings 
in alternations. Thus, it returns abc rather than defg for the pattern 
abc. * ( defg I xdef gy ) . The first version of Woods’ egrep implemented this filter 
literally: matching lines were written into a pipe feeding the normal egrep\ Egrep. 7 
has a stripped down version of the state machine that matches against a string in 
memory (29 lines, a .1 per cent increase) which is run on every line that matched 
the literal string pattern. I he times show the improvement over egrep. 6 . As 
expected, the longer the literal string, the more effective the Boyer-Moore filtering 
is. 1 he production egrep uses Boyer-Moore only for literal strings of length two or 
more. 


pattern 

cercn.fi 

eercn.7 

e£rcD.6/enrep.7 

n . 

2.6s 

3.6s 

.72 

na . 

9.8s 

3.2 s 

3.2 

nan. 

10. Is 

1.9s 

5.3 

name . 

10.1s 

1.7s 

5.9 

name\ . . 

10.1s 

1.2s 

8.4 

name\ . c . 

10.2s 

1.1s 

9.3 


This improvement is not limited to the VAX; measurements on a Sun 3/180 works- 
tation show similar speedups. In fact, searching for a random 30— character string 
in the data file takes no measurable user time on the Sun. Indeed, for the typical 
use where the pattern includes a literal string longer than 3-4 characters that 
matches relatively few lines, run-time is determined primarily by how fast the 
underlying operating system can do I/O. 
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CONCLUSION 


I have described speeding up two members of the grep family of pattern matching 
programs grep and egrep, by two methods. The most important method is using a 
betfer algorithm, such as using Boyer-Moore. This speeded up egrep by a factor of 
8-10; I would expect a similar gain for grep. The second method is 
I/O management by using the fio library: grep improved its user CPU time by a tac 
ror of 3 1-3 8 egrep by 1.6-2. 6. The source of the improvements is shown below, 
segments repr^n the improvement made by introducing the label The numbers 
are" mes relative to the final version. For example, introducing fio input to the 
large output case of grep divided the run-time by 2.2. 


grep 


egrep 


small 


final 


fio input 


final 


fio 

input 


large 


1 1 1 1 1 4= ± ±r- 

1 1 1 

0 .5 1 

l.S z 

2.5 3 

3.5 

C 

1 .5 1 

TT — 

1.5 4 4.0 

final 

fio input 

fto output 

puts 


fio 

final 

fnput 

fio output 


Converting programs to use fio is straightforward and the payback is large prodded 
I/O is significant. Furthermore, fio provides a portable user interface (for ex: * m P - 
that 'doesn’t depend on the size of integers) and the implementation ,s simple and 

C 7 n Tome circles, these changes to grep and egrep would be dismissed as ‘simply 
■ > Nevertheless solving a problem like searching for lines in a file int- 

riSy pits E ™4n g 1». one .he nlgorid,.™ »= 

sufficiently good (and they are now!), I/O problems dominate. , 

Which of grep and egrep should you use? Provided your pattern is one both grep 
and egrep support egrep is generally much faster than grep. On the most common 
pattefnsVtenl strings), it is over an order of magnitude faster. On regular expres- 
sions like d. . .name, egrep is typically 30-40 times as fast as grep. 
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APPENDIX: DETAILED TIMINGS 

The input for these tests has been described above; it is 512,000 bytes with 13,931 
time, given as averages of 10 runs executed on a standalone VAX 11/750. 
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«CtS 9^9 

Krdli " c 1.8 1.7 1.7 1.5 1 5 

mat " h 2 -- 2.1 2.4 1.9 2.0 2 0 

rC3d 0.7 0.8 0.7 0.7 0.7 0 7 

pnntf 11.2 10.4 

fwitc 103 

! )uts 6.7 

£. putt 0.4 

bwntc 

-*** L5 U 1.3 1.2 11 

25J4 16.47 16~ 48 12.30 7^21 hln 

lines of code 476 478 479 478 480 48T 

I he following tests arc the same as the above tests with an additional test to meas 
ure matching efficiency: egrep abed input. No lines match this pattern. 


matching 
15.95 
IS. 03 
16.55 
15.49 
9.99 
1.23 
1.28 


lines of code 
644 
647 
616 
615 
619 
783 

812 
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The Active Deallocation of Objects in Object- 
Oriented Systems 
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SUMMARY 

In object-oriented systems, it is often useful for objects to be allowed to carry out some action 
before they are deallocated. This can be done by defining a destroy method in the object s class, 
and arranging for the memory system to send a message invoking this method immediately 
before deallocating the object. This allows resources associated with the object to be returned 
to the system, limited cross-language garbage collection, and other, more complex, behavMur. 
During y the execution of the destroy method it is possible for new references to ob J« c ‘ s ° ^ 
created Care must be taken that the garbage collection does not erroneously free such objects. 
Algorithms are presented to implement destroy methods in systems using reference counting 
and mark-scan garbage collection techniques. Properties that are desirable in such systems are 
also discussed. 

key words Object-oriented systems Deallocation Destroy method 


INTRODUCTION 

The execution of object-oriented systems centres around tne exchange of messages 
between the objects in the system. These objects are instances of abstract data types 
and are created and destroyed as the program executes. Objects are usually created by 
executing a special creation procedure in the object’s class. Many languages require 
the programmer to deallocate explicitly each object when it becomes unimportant. This 
is especially so in languages, such as C+ + ,' in which object-oriented facilities have 
been added to an existing language. Objects allocated on the stac k _are 
automatically when they go out of scope. However, Smalltalk and LISP-based object- 
oriented systems 3 " 5 use garbage collection 6 to control the deallocation of objects, making 
the programmer’s task easier and less prone to error. ! 

Objects can be used to represent all kinds of data. . Many correspond to aggregates 
in other languages. Some present a resource to the rest of the system, such as an open 
file or terminal, which is used by sending messages to the object. In this case, it may 
be important that a specific action occurs when the resource is no longer needed A 
typical action would be to deallocate the resource, for example to close the file, but 
could also involve other actions, such as flushing buffered output or rewinding a 

t Current address: Department ol Computer Science, University of York, Heslington, York YOl 5DD, U.K. 
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